Duke Statistical Science | Graduation with Distinction
April 11, 2023
| Passage | Learning Objective(s) |
|---|---|
| Storm Paths | modeling; simulation; uncertainty |
| Movie Budgets 1 | compare summary statistics visually |
| Movie Budgets 2 | modeling; \(R^2\); compare trends visually |
| Application Screening | ethics; modeling; proxy variable |
| Banana Conclusions | causation; statistical communication |
| COVID Map | complex visualization; spatial data; time series; sophisticated scales |
| He Said She Said | basic visualization; sophisticated scales |
| Build-a-Plot | data to visualization process |
| Disease Screening | compare classification diagnostics visually |
| Realty Tree | modeling; regression tree; variable selection |
| Website Testing | compare trends visually; uncertainty; modeling; time series; extrapolation |
| Image Recognition | ethics; modeling; representativeness of training data |
| Data Confidentiality | ethics; data deidentification; statistical communication |
| Activity Journal | structure data; store data |
| Movie Wrangling | data cleaning; data wrangling; column-wise string operations; pseudocode; joins |
start with AS: a question based on proxy variable.
You are working on a team that is making a deterministic model to quickly screen through applications for a new position at the company. Based on employment laws, your model may not include variables such as age, race, and gender, which could be potentially discriminatory.
Your colleague suggests including a rule that eliminates candidates with more than 20 years of previous work experience, because they may have high salary expectations. Why might using this variable be considered unethical? Explain your answer.
Oops, best practices would phrase this in a non-leading way. If a student wasn’t initially going to think this would be unethical, but we told them it might be somehow, their explanation won’t be as valuable as someone who would have answered right away. Okay, let’s rephrase to make them answer whether it is or isnt:
You are working on a team that is making a deterministic model to quickly screen through applications for a new position at the company. Based on employment laws, your model may not include variables such as age, race, and gender, which could be potentially discriminatory.
Your colleague suggests including a rule that eliminates candidates with more than 20 years of previous work experience, because they may have high salary expectations. Are there ethical implications of using this variable to select candidates? Explain your answer.
Well… that doesn’t help much. We still have the classic selection bias clouding results; students would think “well, if there wasn’t an ethical problem, they wouldnt have included this as one of the only ethics questions on the assessment.” Plus, how are we grading this? What are we looking for to confirm that they understand the proxy variable? It might work to set up an autograder that marks “correct” if they mark “yes” to the ethics question AND mention “proxy” in their response. But, this is an introductory-level assessment. Will students be able to concisely describe employment expeirence as a “proxy,” or would explanations be wordier and might include “is correlated with,” “is related to,” “goes hand in hand with,” “predicts.” If we want autogradable, looking like MC is going to be main way to go. Note that, at this stage, Application Screening and several other similar questions are left in open-ended format to collect this type of data from the pilots.
so, after that whole mess, a general conclusion is that MC might be the only way to go. so how do we write a good MC ethics question? here’s a start, with the focus on identifiable data and statistical communication:
A newspaper reports on the results of a survey from a small (<2000 student) college. The college agrees to have the data released to the public so long as the students’ identities and academic standing information are kept confidential. Which of the following combinations of variables is less likely to unintentionally identify any students? Explain.
a. Year, major, sports played
b. Year, major
Well, first of all, is “college” the best word here? While it’s roughly synonymous with “university” in the US, they can have very different meanings country-to-country. Thus, let’s eliminate any ambiguity right off the bat.
A newspaper reports on the results of a survey from a small (<2000 student) university. The university agrees to have the data released to the public so long as the students’ identities and academic standing information are kept confidential. Which of the following combinations of variables is less likely to unintentionally identify any students? Explain.
a. Year, major, sports played
b. Year, major
Great! There is no issue with grading this on a large scale, as students will simply choose option “a” to be marked correct. But, how valid is this binary comparison of two nearly-identical options in measuring students’ idea of data privacy (and key variables whose intersections can quickly narrow down populations). Would they choose “a” simply because of the “presence vs absence” heuristic, similar to the selection bias issue addressed earlier, or because they understand how it quickly narrows down who a respondent could be? This question took a lot of brainstorming and workshopping, and we ultimately landed on the following options:
A newspaper reports on the results of a survey from a small (<2000 student) university. The university agrees to have the data released to the public so long as the students’ identities and academic standing information are kept confidential. Which of the following combinations of variables is less likely to unintentionally identify any students? Explain.
a. Class year and sports played
b. Student ID and dorm zip code
c. GPA and major
d. Birth date and phone number
e. None of the above
A data scientist at IMDb has been given a dataset comprised of the revenues and budgets for 2,349 movies made between 1986 and 2016.
Suppose they want to compare several distributional features of the budgets among four different genres—Horror, Drama, Action, and Animation. To do this, they create the following plots.
Fill in the following table by placing a check mark in the cells corresponding to the attributes of the data that can be determined by examining each of the plots.
| Plot A | Plot B | Plot C | Plot D | |
|---|---|---|---|---|
| Mean | ☐ | ☐ | ☐ | ☐ |
| Median | ☐ | ☐ | ☐ | ☐ |
| IQR | ☐ | ☐ | ☐ | ☐ |
| Shape | ☐ | ☐ | ☐ | ☐ |
dsbox packagedsbox packageReference growing DS interest and scalability of education from the assessment talk earlier; that plus the opensource nature of R lends naturally to making such a standardized curriculum
What is Data Science in a Box? Its that ^. Using tidyverse to practice basic data wrangling, visualization, and modeling.
That curriculum set was then condensed into a package for self-learners called dsbox, which users can download and follow to become well-acquainted with basic data science in R.
2 key packages: learnr and gradethis.
learnr provides a robust framework for turning RMarkdown documents into interactive tutorials, where users can be guided through running and writing code, quiz questions, watching videos, etc, directly in the “Tutorial” pane in RStudio. A key feature is that progress is saved, so you can resume working in RStudio whenever.
gradethis takes that basic, broad framework, and provides tools for drilling down deeper when grading. Instructors can provide feedback for a variety of common mistakes with sophisticated testing logic.
9 existing, skeleton for 1 (these corresponded to all HWs from DSinaBox; the package had skeleton .Rd files for the dataset already).
Decided I would try to recreate that tutorial, adding in some flair and thoughts based on best practices
```{r common-themes, exercise = TRUE}
lego_sales |>
___(___)
```
```{r common-themes-hint-1}
Look at the previous question for help!
```
```{r common-themes-solution}
lego_sales |>
count(theme, sort = TRUE)
```
```{r common-themes-check}
grade_this({
if(identical(as.character(.result[1,1]), "Star Wars")) {
pass("You have counted themes and sorted the counts correctly.")
}
if(identical(as.character(.result[1,1]), "Advanced Models ")) {
fail("Did you forget to sort the counts in descending order?")
}
if(identical(as.character(.result[1,1]), "Classic")) {
fail("Did you accidentally sort the counts in ascending order?")
}
if(identical(as.character(.result[1,1]), "Adventure Camp")) {
fail("Did you count subthemes instead of themes?")
}
if(identical(as.numeric(.result[1,2]), 172)) {
fail("Did you count subthemes instead of themes?")
}
fail("Not quite. Take a peek at the hint!")
})
```
Explaining what CRAN is
Explaining what the DESCRIPTION folder and dependencies are
Unfortunately, gradethis is still in development and thus not yet released on CRAN itself. In turn, we are unable to upload a package that specifies a package not on CRAN to be one of its dependencies. We have submitted an issue
Learned advanced computing I wouldn’t have gotten otherwise in my abbreviated trip through the Stat major
Learned how to interact with others’ code beyond scope of classroom/research team (making and reviewing public PR requests, compiling and standardizing and revising the assessment and package)
Statement that “teaching material is only way to master it” had always been true for my tutoring and TAing experiences; developing and studying a curriculum to the point of scrutiny took my understanding to the next level
Newfound appreciation for work that has gone into all the educational curriculum materials today, tools like ghclass and learnr, all packages that have been released and are maintained and what it takes to do that.
Inspired me to continue interacting with the world of open source software even though my job (for the meantime) is to be an Excel monkey for 40 hours a week
Browse at your own pace at https://evandragich.github.io/thesis-work/
Email me at emd48@duke.edu